PROC SURVEYSELECT as a Tool for Drawing Random Samples
نویسنده
چکیده
This paper illustrates some of the many sampling algorithms built into PROC SURVEYSELECT, particularly those pertinent to complex surveys, such as systematic, probability proportional to size (PPS), stratified, and cluster sampling. The primary objectives of the paper are to provide background on why these techniques are used in practice and to demonstrate their application via syntax examples. Hence, this is not a how-to paper on designing a statistically efficient sample—there are entire textbooks devoted to that subject. One exception is that the paper will discuss a few recently incorporated sample allocation strategies—specifically, proportional, Neyman, and optimal allocation. The paper concludes with a few examples demonstrating how one can use PROC SURVEYSELECT to handle certain frequently-encountered sample design issues such as alternative sampling methods across strata and multi-stage cluster sampling. INTRODUCTION The purpose of this paper is to highlight some of the capabilities of PROC SURVEYSELECT in SAS® to facilitate the task of drawing a random sample. Although the various sampling approaches will be introduced with moderate amounts of background and discussion regarding why and how they are used in practice, the intent is to demonstrate the necessary PROC SURVEYSELECT syntax to carry out these techniques. For an accessible introduction into the issues and concepts of survey sampling, see Kalton (1983). The reader seeking a more in-depth treatment of the underlying theory involved in designing an efficient complex survey sample is referred to one of the many excellent texts on the subject, such as Kish (1965), Cochran (1977), Scheaffer, Mendenhall, and Ott (1996), Lohr (1999), and Valliant, Dever, and Kreuter (2013) to name a few. This paper is structured into three main sections. The first touches on three fundamental random sampling techniques: (1) simple random sampling; (2) systematic sampling; and (3) probability proportional to size sampling. The second section discusses how the STRATA statement can be used to apply these techniques independently within two or more strata defined on the sampling frame. The third section illustrates how the CLUSTER statement can be used to select a random sample of groups of the underlying units of analysis. To motivate exposition of the various sampling techniques, suppose a market research firm has been hired to evaluate the spending habits of N = 2,000 adults living in a small city. Two example statistics of interest are the average amount of money an adult spent during the previous year on over-the-counter (OTC) medications and the average amount spent on travel outside the city. Assume the data set FRAME is the sample frame containing one record for each of the N = 2000 unique adults in the population and consisting of the following variables: ADULTID – a numeric variable ranging from 1 to 2,000 that uniquely identifies each adult. BLOCKID – a numeric variable ranging from 1 to 100 that denotes the distinct block on which each adult lives. All C = 100 blocks in FRAME consist of exactly Nc = 20 adults. CITYSIDE – a character variable with two possible values—“East” or “West”—that distinguishes which side of a river dissecting the city a given block falls. INCOME – an aggregate measure of income for each adult during the most recent year, which we will assume has been obtained from the local taxing authority. Over the course of the paper, a variety of progressively more complex sample designs will be demonstrated, each with the fixed sample size of n = 400. There is nothing particularly noteworthy about this number, but perhaps we can think of it as the maximum sample size permitted by the market research firm’s data collection budget. FUNDAMENTAL SAMPLING TECHNIQUES SIMPLE RANDOM SAMPLING The example syntax below shows how to conduct the most basic (and default) method available in PROC SURVEYSELECT, simple random sampling without replacement (SRSWOR). All of the options utilized appear in the PROC statement. The DATA= option points to the sample frame, while the OUT= statement names the output data set SAMPLE_SRSWOR that will house the resulting sample. The SAMPSIZE= option is used to declare a sample size of n = 400. Assigning a random number with the SEED= option ensures the exact sample will be selected if the PROC SURVEYSELECT syntax is resubmitted at a later time—assuming an equivalent input data set in the same sort order. Although this is technically optional, it is generally good practice to do so. As the reader will observe, each PROC SURVEYSELECT step demonstrated in this paper specifies a unique seed. 2 proc surveyselect data=frame out=sample_SRSWOR sampsize=400 seed=40029; run; The real “output” from PROC SURVEYSELECT is the SAMPLE_SRSWOR data set consisting of 400 observations drawn randomly from FRAME, but a brief rundown of what occurred is reported in the listing. For example, the summary generated from the syntax submitted above appears below. For brevity purposes, this is the only occasion output such as this will be included in the paper. There is a tacit METHOD=SRS option in the PROC statement in the example above. A variety of alternative randomized sampling schemes are available. For instance, to request simple random sampling with replacement (SRSWR), we can specify METHOD=URS. (URS stands for unrestricted random sampling.) The following example illustrates the syntax to conduct this method for the same sample size of 400. proc surveyselect data=frame out=sample_SRSWR sampsize=400 seed=22207 method=URS outhits; run; Aside from specifying METHOD=URS and a new seed, the SRSWR syntax is very similar to that of SRSWOR. One exception is the OUTHITS option appearing in the PROC statement. Whenever a with-replacement design is specified by the user, the default output data set consists of one row for each unique record sampled and a numeric variable called NUMBERHITS indicating how many times the particular record was chosen. There may be occasions when this is preferable, but the OUTHITS option requests a separate record be output for each selection, forcing the number of rows in the output data set to match the sample size. The NUMBERHITS variable is still retained in the OUT= data set, however. SYSTEMATIC RANDOM SAMPLING Another widely used approach in applied survey research is systematic sampling. The basic idea is to select every k th unit into the sample, where k is typically an integer. This method is particularly utile in scenarios where it would be exorbitantly difficult or altogether impossible to construct a sample frame. Consider a doctor’s office that maintains each patient’s information in a physical folder sorted alphabetically by surname. If the sampling unit is the patient, it would be much easier to sample, say, every 50 th folder than to enumerate all folders, draw a sample, and assemble the list of folders by retrieving them one at a time. Another example is a customer satisfaction survey for a grocery store. It is improbably an exhaustive list of patrons exists, so one rational method of random sampling would be to try to engage every 20 th customer leaving the store—of course, it would be smart to randomly assign the days and time(s) of day during which these attempts are made. Of course, the technique can also be used when a well-defined sample frame exists, as is the case with the hypothetical expenditure survey. The example below demonstrates the basic syntax for selecting a systematic sample of n = 400 adults. Specifying METHOD=SYS in the PROC statement initiates this selection technique. PROC SURVEYSELECT calculates the sampling interval k as N / n, where N is inferred from the number of observations in the input data set. If k is not explicitly an integer, a fractional interval is used such that the exact sample size requested is returned (see the documentation for more details). For ease of exposition, we intentionally allow for an integer interval of k = 2000 / 400 = 5. In essence, PROC SURVEYSELECT begins by randomly choosing a starting point between the 1 st and k th observation in the input data set. We might denote this observation r. The sample will consist of the r th observation and the (r + k) th , (r + 2k) th , (r + 3k) th , etc., on down through the end of the data set. For instance, suppose the first adult selected is the 4 th . The sample in the output data set SAMPLE_SYS would consist of this individual followed by the 9 th , 14 th , 19 th , ..., and 1999 th . proc surveyselect data=frame out=sample_SYS sampsize=400 seed=65401 method=SYS; run; The advantages and disadvantages of systematic sampling are laid out plainly in Chapter 8 of Cochran (1977). One salient disadvantage is that there is no “standard” variance formula adaptation(s) to be applied as there is with stratified or clustered sampling. Cochran illustrates scenarios where the sample could behave like a stratified or clustered sample. The former is The SURVEYSELECT Procedure Selection Method Simple Random Sampling Input Data Set FRAME Random Number Seed 40029 Sample Size 400 Selection Probability 0.2 Sampling Weight 5 Output Data Set SAMPLE_SRSWOR
منابع مشابه
168-31: Getting Your Random Sample in PROC SQL
Proc SQL can be used to get a random sample from a large dataset with relative ease. A more common method of getting a random sample from a large dataset requires using the data step along with some programming or using the SURVEYSELECT procedure which became available in SAS/STAT beginning with SAS Version 8 ®. It is relatively easy to get a simple random sample using only the SQL procedure. I...
متن کامل217-2013: A SAS® Macro for Generating a Set of All Possible Samples with Unequal Probabilities without Replacement
This paper considers listing all possible samples of size n with unequal probabilities without replacement in order to find the sample distribution. The main application of that is to estimate the Horvitz-Thompson (HT) estimator and possibly to know the shape of its sample distribution to construct confidence intervals. The algorithm computes all possible samples of the population, in contrast ...
متن کامل228-2011: Matching-Adjusted Indirect Comparison Analysis Using Common SAS® 9.2 Procedures
This paper presents a novel matching-adjusted approach to indirectly compare survival estimates for competitive treatment options. Using patient-level data for the treatment arm and summary patient characteristics and survival outcomes for the comparator, matching variables prognostic for survival are chosen. A program involving an extension of a common SAS® 9.2 procedure, PROC SURVEYSELECT, is...
متن کاملThe Relation between Visual Associations of a Random Fixed Cardboard Piece and Introversion-Extroversion as well as the Level of Creativity in a Group of Visual Artists
The mental visual association of a vague and shapeless form may happen for anybody but only an artist can make an objective form out of it through drawing. This article studies the relationship between the visual associations of a random cardboard piece as the pattern drawn by the artists and the type of their personality as well as the level of their creativity. The samples consist of 80 arti...
متن کاملReceiver Operating Characteristic (ROC) Curve: comparing parametric estimation, Monte Carlo simulation and numerical integration
A receiver operating characteristic (ROC) curve is a plot of predictive model probabilities of true positives (sensitivity) as a function of probabilities of false positives (1 – specificity) for a set of possible cutoff points. Some of the SAS/STAT procedures do not have built-in options for ROC curves and there have been a few suggestions in previous SAS forums to address the issue by using e...
متن کامل